Assessing the role of evolutionary information for enhancing protein language model embeddings

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.

: Q3 and SOV performances on validation set for different models.
Table S4: MCC performances on validation set for different models.

Conservation Prediction
We tested two sets of methods to compute residue conservation within protein families.The first set, including ConSeq 4 and MMseqs2 3 , read conservation from the multiple sequence alignment (MSA) describing the family.The second set, including VESPA 5 and our newly introduced VESPA MSACons, predict conservation based on embeddings.Initially, the reported performance of the default embedding-only version of VESPA showed a stronger correlation with ConSeq than any alignment-informed version, trying to boost embeddings by explicitly using MSA information, that we tested (Fig. S5).For predicting perresidue conservation in a family (as derived from ConSurf 6 ), we were surprised to find that integrating MSA-based information significantly decreased performance in comparison to the reported performance since the ground truth -conservation-is explicitly encoded into the MSA.However, we were unable to reproduce the reported performance when rerunning VESPA on the sequences of the ConSurf10k test set.Our re-evaluation showed that VESPA's performance was significantly lower, achieving similar performance to MSACons with MMseqs2 alignments.Furthermore, when using the original MAFFT alignments from ConSurfDB for our MSACons approach, we observed statistically significant (at 95% confidence interval -CI: ±1.96 stderr) performance improvements to our re-evaluation of VESPA (Fig. S5).Our reevaluation suggests that the published model performance is actually much lower than reported, aligning closely with our MSACons approach using MMseqs2 MSAs, and even being surpassed when using MAFFT 7 MSAs.We hypothesize that this improvement is due to the consistency between the MSAs used for label generation and those used in our MSACons approach.

Limitations of Family Size Analysis
A major obstacle in comprehensively establishing if proteins from small families (few proteins in MSA) are predicted differently from those in large families is twofold.Firstly, very few high-resolution experiments target proteins from small families. 8This implies that differences will likely not be statistically significant.
Secondly and more importantly, smaller families are likely to differ in their biophysical, dynamical, and functional characteristics from larger families 9 .The simple argument of the theoretical biophysicist Alyosha Finkelstein for this assumption can be sketched as follows (using the term fold to loosely describe the main 3D scaffold conserved between two proteins with diverged sequences): not all folds are equally likely to realize and stabilize; thus, some folds are more likely to occur than others.In other words, these folds are fitter, which explains why some folds are more populated (larger families) than others.If true, we expect those from more prominent families to have different biophysical features than those from smaller families.Thus, even if we could collect enough samples in the future, these would still not be representative because we would compare apples and oranges.

Limitations of Training pLMs without Evolutionary Information
That removing all evolutionary information from training the pLMs is currently impossible can be concluded by a simple number game: explicitly using MSAs only failed for pLMs such as ProtT5 2 trained with over 2x10 9 sequences from BFD 10 ; the entire UniProt 11 with about ten times fewer sequences (0.2x10 9 ) did not suffice (ProtBert, Fig. 1).While a data reduction by a factor of ten already makes it impossible to narrow down what information pLMs cover, training pLMs only on non-redundant (e.g., at the level of less than 20% pairwise sequence identity) would reduce the data more likely by a factor of 100-1000 than ten.This means that before we have databases with 100-1000 times more sequences, this remains completely impossible.At the current rate of sequence databases by far outgrowing the fastest evolving key of modern growth, namely computer chips, this would still require 10-20 years.
Additionally, we know that pLMs such as ProtT5 resolve more frequently occurring amino acids more accurately than less frequent amino acids (M Heinzinger, TUM, unpublished and 5 ).However, the frequency of most (Leucine ~10%) and least frequent amino acid (Cysteine ~1.5%) differ by less than a factor of ten.In contrast, finding two proteins from the same family in the ocean of unrelated pairs is a much less frequent event, as evident from the following back-of-an-envelope calculation: The most prominent families have about 100k (100,000) proteins 12 (as an average, this is a gross overestimate because the 100th largest family is already over 10 times smaller).Assume this to be the average for all families.When feeding BFD with 2x10 9 sequences into pLMs, pairs from the same family would, on average, occur every 20,000th time (2x10 9 / 10 5 ), i.e., the difference between positives and negatives would be 2,000 times larger for the same family/not than for different types of amino acids.While this number argument is, strictly speaking, no proof, it still illustrates the magnitude of the problem.

Limitations of biasing pLM training towards Families
One idea for biasing pLM training towards families could be to systematically pick all family members in one batch, i.e., by not selecting the next protein to train randomly.This proposition exceeds our computing resources because we simply have no funding for retraining a foundation pLM, in particular, given the limited chance of success.Furthermore, it remains unclear how informative a negative result would be: interfering with random choices on such an extreme level might do much damage, as we learned when first trying to use a simple neural network for secondary structure prediction and choosing samples one protein at a time, rather than randomly 13 .Thus, if such an approach wouldn't improve, the lack of improvement might be attributed to "incorrect sampling," evading the answer to our question correlation or capture once again.Conversely, if biased family sampling improved, we would gain insights.However, we hold this so unlikely that we would hesitate to invest substantial resources toward this end even if we had them.
SOM page 14

MSAConsensus predictions for SETH, VESPA, bindEmbed21DL and TMbed
Identical to the MSAConsensus predictions for our own method, we created MSAs with MMseqs2 3 for all relevant test sets (ConSurf10k 5 for VESPA 5 , TestSet225 14 for bindEmbed21DL 14 , 57 β-TMP 15 and 571 α-TMPs 15 for TMbed 15 , CheZOD117 16 for SETH 17 ).We used the method of interest to generate predictions for each sequence in the MSA.The per-residue predictions for each aligned sequence were mapped to the aligned position of the residue in the query sequence.For SETH, the mean of all predictions, that map to an individual query residue was computed and used as the MSAConsensus prediction.For bindEmbed21 a majority vote out of 8 classes (non-binding, metal, nuclear, small, metal and nuclear, metal and small, nuclear and small, binding all 3), for TMbed a majority vote out of 2 classes (either TM-helix/non helix or TM-sheet/non sheet) and for VESPA a majority vote out of 9 classes (1-9) was used to determine the MSAConsensus prediction.

Additional Performance Measures
For secondary structure prediction, besides Q3, we evaluated performance through several additional measures.For simplicity, we used the following standard annotations, with x ∈ {H, E, -}: True positives (TPx) were correctly predicted secondary structures of class x.In contrast, false positives (FPx) were predicted secondary structures of class x that were experimentally annotated as not x.True negatives (TNx) were correctly predicted as not of class x, and false negatives (FNx) were residues annotated as class x but incorrectly predicted to be not x.We calculated MCC (Eqn. 1) separately for all secondary structure classes: As additional combined performance measures, we used the fraction overlap measure (SOV, Eqn. 2 [18][19][20] ): 0 refers to all observed helix, strand, and other segments, and  1 to all predicted segments.S(i) denotes the set of all overlapping pairs ( 0 ,  1 ), and N(i) is the sum of elements in S(i) and the number of all segments s_0 that are not overlapped by a predicted segment of identical state.The length of any segment  0 in amino acid residues is given by len( 0 ), the length of the actual overlap of a given state by minov( 0 ,  1 ), and the extend, for which at least one residue is in the given state by maxov( 0 ,  1 ).δ( 0 ,  1 ) is defined as: ( 0 , 1 ) (Eqn.S3) Evolutionary Information Encoded in PLMs Supporting online material SOM page 18  MCC performance of SignalP-6.0 37, SignalP-5.0 36(original and retrained) DEEPSIG 38 and Random Rate on Eukarya signal peptide prediction on the SignalP-5.0benchmark dataset 36 .The embedding based SignalP-6.0and the original SignalP-5.0achieve similar performances, with a at least numerical improvement for SignalP-6.0.Performances for SignalP-5.0,DEEPSIG, and SignalP-6.0were obtained from Nallapareddy et al.

Related Work
Random Rate: randomly predict labels by drawing from given class distribution.
1.1.3Weight Visualizations Figure S2: Visualization of CNN layer weights in the first layer of PSSMConcat Figure S3: Visualization of CNN layer weights in the concatenation layer of PSSMSplit 1.2 Conservation Prediction Figure S4: Trade-off of conservation thresholds in terms of MCC and Q2 Figure S5: Raw embeddings vs MSACons for conservation prediction 2. DISCUSSION 2.1 Limitations of Family Size Analysis 2.2 Limitations of Training pLMs without Evolutionary Information 2.3 Limitations of biasing pLM training towards Families 3. MATERIALS AND METHODS 3.1 ML Architectures Figure S6: Sketch of the architecture used for raw embeddings, MSA embeddings and MSAConsensus Figure S7: Sketch of PSSMSplit Figure S8: Sketch of PSSMConcat 3.2 MSAConsensus predictions for SETH, VESPA, bindEmbed21DL and TMbed 3.3 Additional Performance Measures

Figure S1 :
Figure S1: Q3 performances of different model and embedding types on validation set.

Figure S2 :
Figure S2: Visualization of CNN layer weights in the first layer of PSSMConcat

Figure S3 :
Figure S3: Visualization of CNN layer weights in the concatenation layer of PSSMSplit

Figure S4 :
Figure S4: Trade-off of conservation thresholds in terms of MCC and Q2

Figure S6 :
Figure S6: Sketch of the architecture used for raw embeddings, MSA embeddings and MSAConsensus

Table S1 :
Q3 and SOV performances on test set for different models.TableS2: MCC performances on test set for different models.1.1.2.Validation SetFigureS1: Q3 performances of different model and embedding types on validation set.Table

.68 ± 0.01 0.60 ± 0.01 0.591 ± 0.006
MCC performances on the TEST100 dataset are shown.For SeqVec models, the MSA embeddings clearly outperform all other models in all classes.For ProtBert models, a significantly improvement can still be observed for using MSA embeddings in comparison to raw embeddings for sheet predictions.For the helix and other class raw embeddings and MSA embeddings perform similar.
For ProtT5 models, all models perform similar but raw embeddings achieving the numerically highest performance in all 3 classes.±valuesmark the standard error (Eqn.7).For each column, numerically highest performances for SeqVec, ProtBert and ProtT5 are highlighted in bold.Distribution baseline: randomly predict labels by drawing from given class distribution.Majority class baseline: predict majority class.

Table S9 : Secondary Structure Q3 prediction performance.
Ankh 29 , ProtT5-XL-U50 2 , NetSurfP-3.0 30 , the T5 raw embedding model from this work, Random Rate baseline and ZeroR baseline for secondary structure prediction in 3 classes on the CASP12 dataset 28 .Embedding based methods like our ProtT5 raw embeddings method, ProtT5-XL-U50 and Ankh, are able to compete with evolutionary information-based ones like NetSurfP-2.0.All methods clearly outperform the Random Rate and ZeroR baselines.Performances for NetSurfP-2.0,NetSurfP-3.0,Ankh, and ProtT5-XL-U50 were obtained fromElnaggar et al. and Hoie et al.. Random Rate and ZeroR baselines for secondary structure were computed in the context of this work.For the Q3 column, significantly best results are highlighted in bold.

Table S10 : 3D protein structure prediction TM-scores.
performances of AlphaFold2 31 and ESMFold 32 on 3D protein structure prediction on the CAMEO 33 and CASP14 34 datasets.The embedding based ESMFold, which uses evolutionary information in the form of MSAs during inference performs similar to AlphaFold2 on the CAMEO dataset but AlphaFold2 outperforms ESMFold on the CASP14 dataset.Performances were obtained from Lin et al..